Harnessing the Hidden Potential of Databases
Exploring the Why and How of Database Utilization
in Research and Learning
November 2, 2023
Slides URL
The slides contain links, and so here is
a link to this exact slide deck
Talk Structure
Background: How did I get here?
Introduction
Past research
What are databases?
Basics
Compare and contrast to spreadsheets
Installation and Use
Options: from simple to complex
Interfaces
Examples
Summary
Background: How did I get here?
Grad Research
My graduate research centered around refining our estimates of the
mass distribution of the near-Earth asteroid population
Too small to be detected by most means, my detection screen was the
shadowed portion of the moon
Involved many nights of observations
As commonly happens, the amount of data snuck up on me
Tracking Observations
Willamette Research
Cloudy skies makes lunar observations impractical
What if Earth’s skies became the “detection screen” instead?
Any part of the sky being clear could allow for observations
Need an automated recording system
Detection analysis done onboard with a microcomputer
The data load still crept up on me! Despite more
efforts!
Currently
Upon joining the SCIS faculty, I took on all the database and data
management courses
I’ve taught the undergrad DATA 351 course twice
I’ve taught the graduate DATA 503 course three times, soon to be 5
this spring
I continue to teach the introductory programming courses in Python,
and run the Section Leader program
I teach courses that lie at the intersection of the natural sciences
and data science, currently Data in the Cosmos
A Family
There are actually a variety of types of databases, each with their
tradeoffs
Relational databases
Document databases
Graph databases
For those coming from familiarity with spreadsheets, relational
databases will probably feel the more intuitive
Relational databases also happen to still be incredibly popular in
industry
Why Relational Tables?
Suppose you wanted to keep track of your friends’ birthdays
This could easily be accomplished in a single table
Frank
Stein
4/2/2000
Tessa
Loch
8/23/2003
Bobby
Wolf
12/14/2005
Complicating Things
What if you also wanted to track what classes each friend was
taking?
Could imagine each friend taking 2-3 classes, some of which
overlap
This complicates things more than might be expected, since to still
store this in a single simple table now requires data duplication
Classy Friends
Frank
Stein
4/2/2000
CS151
MWF
1:00pm
Frank
Stein
4/2/2000
MATH256
MWF
9:00am
Tessa
Loch
8/23/2003
CS151
MWF
1:00pm
Tessa
Loch
8/23/2003
IDS236
TTh
1:00pm
Tessa
Loch
8/23/2003
HIST123
MWF
12:00pm
Bobby
Wolf
12/14/2005
IDS236
TTh
1:00pm
Bobby
Wolf
12/14/2005
MATH256
MWF
9:00am
Data Duplication = BAD
Relational Tables
One solution to this issue is realizing that we are trying to
actually keep track of two things: friends and classes.
So we break things up into two tables, and then create
relationships between them
This is the core of what occurs in a relational
database !
Frank
Stein
4/2/2000
Tessa
Loch
8/23/2003
Bobby
Wolf
12/14/2005
CS151
MWF
1:00pm
MATH256
MWF
9:00am
IDS236
TTh
1:00pm
HIST123
MWF
12:00pm
PHYS221
MWF
10:00pm
Viewing Relational Databases
In general, you’d probably use a third table to represent all the
linkages
Unique id keys are used to connect the different tables
Relational Database Schema
Differences from Spreadsheets
Databases have a much more rigid structure
You define up front what columns you want, and the
data type associated with that column
Each row represents a record or observation, and data is entered one
row at a time.
Calculations are generally done through a query, not stored in cells
within the table
Eliminates any confusion between the original data and derived
values
Only the data matters. There is no coloring or other formatting.
Visualizations are handled by external tools
Benefits over spreadsheets
Data Integrity
Columns enforce types, calculations can not be mistaken for data,
relationships between tables enforced
More effective at eliminating duplicate information, and the
possible issues that arise
Far better performance for huge numbers of records
Database servers are build for multi-user access, facilitating
sharing of data and with full user access control and permissions
Excellent support from all popular coding languages for accessing
database information
Separates where the data is stored from where it is analyzed
SQL
Pronounced “ESS-QUE-ELL”, or, historically “sequel”, SQL is the
dominant language for interacting with relational databases
A declarative language, so fairly straightforward to learn
Primarily used for managing the database, adding records to the
database, or extracting specific information from the database
SELECT column_names
FROM table_name
WHERE condition_is_true
ORDER BY column_name;
The Spice of Life
Relational databases come in a surprisingly large variety of forms
and formats
Many are free and/or open source
All use a core set of standard SQL, and then start to branch from
there
Extra functions that are made available
Different data types available for columns
Options can be overwhelming, but we’ll cover a few common cases
Easiest: SQLite
SQLite stores each database in a single file
This can make sharing or moving around the database very simple, as
people are well acquainted with working with files
Installation is simple, or even unnecessary, as the tools for
creating, managing, and accessing the database are just bundled in a zip
file
Documentation on website is good
Many programming languages can interface with SQLite databases
Simple nature and easy installation make it a great starting
point
Local Server
Most other database installations operate on a server model
The server operates as the central hub that clients can connect to
to interact with the database
Nothing special about the server aside from it needing to be powered
on in order for the data to be accessible
Could install on a personal system or on a lab or other public
system if you want easy access from multiple individuals
The big players here include PostgreSQL, MySQL, SQL Server, and
Oracle
I’ve been teaching PostgreSQL in class and been very happy
Installation is more involved, but still not overly complex through
EDB here
or through Postgres.app here for
Mac users
Remote Server
If a local server install is too intimidating, or you don’t have a
good computer to use as a server, there is no shortage of online
deployment options as well, though they come with a cost
The tech giants have offerings through Azure, AWS, or Google Cloud
Platform, but I’ve found them to be overly complicated for most
individual or small group needs
My favorite recommendation for the past year has been Railway.app , though it is not quite as
free now as it once was
Digital
Ocean also has super simple setups, though it runs a bit more
expensive
SQL Client
For any server-based database (though they tend to work with SQLite
as well), you need a client to connect to, view, and run SQL code
against the database
The list of possible clients is massive and can vary between
operating systems
Many are capable of connecting to most of the different database
server options
My favorite at the moment is Beekeeper
Studio (Community Edition)
Another popular client that will connect to most anything is DBeaver
Most any client will let you browse table, make edits directly in
tables, or run SQL to otherwise manage the database
NocoDB
While most SQL clients will show a grid view of a table and may
allow edits, they aren’t really intended for that to be the main method
of interaction
If you want to leverage the benefits of a database while still
leveraging the comfort of a spreadsheet, some options exist
Airtable is a popular option
that streamlines working with a database for non-coding folks. Free
small plans exist.
NocoDB is an open-source offering
that I came across recently which aims to mimic much of what has made
Airtable so popular
Either option can connect to an internal database or to an external,
already existing database
Both can significantly simplify allowing multiple users to edit the
database, or provide read-only views to certain viewers
Both also allow the creation of basic forms to streamline entering
new information into the database, which otherwise requires some coding
knowledge
Polling
For years now I’ve used an SQLite database to record student
submissions to my in-class polls
SQLite database and the web server that runs the polling form live
on a Raspberry Pi microcomputer
Later I go back and query the responses to generate my participation
scores
Over 18k submissions stored
Grades
I generate my own grade reports for my classes, and have done so for
years
Switching to using a database on the backend has made this
so much easier
Also allows me total control over implementing other programs that
can interface with this system
Grading rubrics for helping me grade
Grade reporting systems
Interfacing with submission systems (GitHub Classroom)
Pros/Cons
More robust toward data integrity and organization
More performant as the amount of data increases
Able to be easily connected to from a host of other software or
programs
Excellent for storage of accruing data for later analysis
Do one thing, and do it very well
Are incredibly flexible in being able to store, analyze, and
visualize data all at once
Are more visual to use, requiring no knowledge of a coding language
or SQL
Generally expected to store mainly numbers or text, and can struggle
with other data formats
Perfect for quick analysis of relatively small or uncomplicated
datasets
Wrapping Up
Thanks to all of you for attending!
Hopefully I’ve at least put the thought in your head that in certain
cases databases can be worth the shift.
They have massively impacted the way I operate (for the better)
these past several years.
Please know that I’m always happy to help advise or assist anyone
with database related questions.